Primary Data Encoding of a Bilingual Corpus

نویسندگان

  • Johann GAMPER
  • Paolo DONGILLI
چکیده

This paper discusses the building of a bilingual corpus of legal and administrative texts, focusing on the encoding of documentation and structural information according to the Corpus Encoding Standard. The corpus is one module in an ongoing research project about (semi-)automatic terminology acquisition at the European Academy Bolzano and will serve as a basis for applying term extraction programs. We will discuss the pieces of information to be annotated as well as lessons learned during this process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Encoding Hierarchical Bilingual Texts of Hong Kong Laws with XCES

This paper presents our recent work on encoding the hierarchical English-Chinese bilingual BLIS corpus of HK laws following XCES. Our work is aimed not only at facilitating data capture for language engineering such as machine translation but also at laying down a foundation for a suitable presentation of legislative texts for various purposes (e.g., rendering the loose-leaf edition of HK laws ...

متن کامل

Lexical semantic typologies from bilingual corpora - A framework

We present a framework, based on Sejane and Eger (2012), for inducing lexical semantic typologies for groups of languages. Our framework rests on lexical semantic association networks derived from encoding, via bilingual corpora, each language in a common reference language, the tertium comparationis, so that distances between languages can easily be determined.

متن کامل

Combining Corpus and Machine - ReadableDictionary Data for Building Bilingual

This paper describes and discusses some theoretical and practical problems arising from developing a system to combine the structured but incomplete information from machine readable dictionaries (MRDs) with the unstructured but more complete information available in corpora for the creation of a bilingual lexical data base, presenting a methodology to integrate information from both sources in...

متن کامل

بررسی امکان طراحی و تدوین برنامه درسی دوزبانه (فارسی و لری) در آموزش ابتدایی استان لرستان از نظر معلمان و مدیران

The purpose is to survey possibility for design and provide bilingual curriculum (Farsi and Lori) from principals and teachers’ viewpoint in primary schools in Lorestan. This is descriptive survey. Statistical population includes all the principals and teachers of primary schools in Khorramabad, Nourabad, and Koohdasht in 2010-2011 academic year. Sampling was done by simple random sampling meth...

متن کامل

Using RBMT Systems to Produce Bilingual Corpus for SMT

This paper proposes a method using the existing Rule-based Machine Translation (RBMT) system as a black box to produce synthetic bilingual corpus, which will be used as training data for the Statistical Machine Translation (SMT) system. We use the existing RBMT system to translate the monolingual corpus into synthetic bilingual corpus. With the synthetic bilingual corpus, we can build an SMT sy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999